Search CORE

46 research outputs found

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

Author: Goyal Raghav
Gutmann Michael
Hadsell Raia
Sayed Nawid
van der Maaten Laurens
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/08/2020
Field of study

We propose a self-supervised method to learn feature representations from videos. A standard approach in traditional self-supervised methods uses positive-negative data pairs to train with contrastive learning strategy. In such a case, different modalities of the same video are treated as positives and video clips from a different video are treated as negatives. Because the spatio-temporal information is important for video representation, we extend the negative samples by introducing intra-negative samples, which are transformed from the same anchor video by breaking temporal relations in video clips. With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn video representations. There are many flexible options in our IIC framework and we conduct experiments by using several different configurations. Evaluations are conducted on video retrieval and video recognition tasks using the learned video representation. Our proposed IIC outperforms current state-of-the-art results by a large margin, such as 16.7% and 9.5% points improvements in top-1 accuracy on UCF101 and HMDB51 datasets for video retrieval, respectively. For video recognition, improvements can also be obtained on these two benchmark datasets. Code is available at https://github.com/BestJuly/Inter-intra-video-contrastive-learning.Comment: Accepted by ACMMM 2020. Our project page is at https://bestjuly.github.io/Inter-intra-video-contrastive-learning

arXiv.org e-Print Archive

Crossref

Intent Identification and Entity Extraction for Healthcare Queries in Indic Languages

Author: Chaitanya G Sai
Goyal Pawan
Mondal Ishani
Mullick Ankan
Raghav R
Ray Sourjyadip
Publication venue
Publication date: 19/02/2023
Field of study

Scarcity of data and technological limitations for resource-poor languages in developing countries like India poses a threat to the development of sophisticated NLU systems for healthcare. To assess the current status of various state-of-the-art language models in healthcare, this paper studies the problem by initially proposing two different Healthcare datasets, Indian Healthcare Query Intent-WebMD and 1mg (IHQID-WebMD and IHQID-1mg) and one real world Indian hospital query data in English and multiple Indic languages (Hindi, Bengali, Tamil, Telugu, Marathi and Gujarati) which are annotated with the query intents as well as entities. Our aim is to detect query intents and extract corresponding entities. We perform extensive experiments on a set of models in various realistic settings and explore two scenarios based on the access to English data only (less costly) and access to target language data (more expensive). We analyze context specific practical relevancy through empirical analysis. The results, expressed in terms of overall F1 score show that our approach is practically useful to identify intents and entities

arXiv.org e-Print Archive

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

Author: Feiszli Matt
Goyal Raghav
Mavroudi Effrosyni
Sigal Leonid
Sukhbaatar Sainbayar
Torresani Lorenzo
Tran Du
Yang Xitong
Publication venue
Publication date: 15/02/2023
Field of study

Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences. These tasks differ in the type of inputs (only video, or video-query pair where query is an image region or sentence) and outputs (temporal segments or spatio-temporal tubes). However, at their core they require the same fundamental understanding of the video, i.e., the actors and objects in it, their actions and interactions. So far these tasks have been tackled in isolation with individual, highly specialized architectures, which do not exploit the interplay between tasks. In contrast, in this paper, we present a single, unified model for tackling query-based video understanding in long-form videos. In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark which entail queries of three different forms: given an egocentric video and a visual, textual or activity query, the goal is to determine when and where the answer can be seen within the video. Our model design is inspired by recent query-based approaches to spatio-temporal grounding, and contains modality-specific query encoders and task-specific sliding window inference that allow multi-task training with diverse input modalities and different structured outputs. We exhaustively analyze relationships among the tasks and illustrate that cross-task learning leads to improved performance on each individual task, as well as the ability to generalize to unseen tasks, such as zero-shot spatial localization of language queries

arXiv.org e-Print Archive

Beyond Simple Meta-Learning: Multi-Purpose Models for Multi-Domain, Active and Continual Few-Shot Learning

Author: Barber Jarred
Bateni Peyman
Goyal Raghav
Masrani Vaden
Sigal Leonid
van de Meent Jan-Willem
Wood Frank
Publication venue
Publication date: 12/12/2022
Field of study

Modern deep learning requires large-scale extensively labelled datasets for training. Few-shot learning aims to alleviate this issue by learning effectively from few labelled examples. In previously proposed few-shot visual classifiers, it is assumed that the feature manifold, where classifier decisions are made, has uncorrelated feature dimensions and uniform feature variance. In this work, we focus on addressing the limitations arising from this assumption by proposing a variance-sensitive class of models that operates in a low-label regime. The first method, Simple CNAPS, employs a hierarchically regularized Mahalanobis-distance based classifier combined with a state of the art neural adaptive feature extractor to achieve strong performance on Meta-Dataset, mini-ImageNet and tiered-ImageNet benchmarks. We further extend this approach to a transductive learning setting, proposing Transductive CNAPS. This transductive method combines a soft k-means parameter refinement procedure with a two-step task encoder to achieve improved test-time classification accuracy using unlabelled data. Transductive CNAPS achieves state of the art performance on Meta-Dataset. Finally, we explore the use of our methods (Simple and Transductive) for "out of the box" continual and active learning. Extensive experiments on large scale benchmarks illustrate robustness and versatility of this, relatively speaking, simple class of models. All trained model checkpoints and corresponding source codes have been made publicly available

arXiv.org e-Print Archive

Evaluation of Interactive Rhythm Activities on the Engagement Level of Individuals with Memory Impairments

Author: Bomba Jared
Christensen Judith
Goyal Raghav
Hammond Marc
Hoang Van
Karabachev Alexander
Meagher Ellen
Nelson Laura
Shihadeh Hanaa
Publication venue: UVM ScholarWorks
Publication date: 23/01/2019
Field of study

Alzheimer\u27s dementia can lead to a decreased quality of life in patients through the manifestation of inappropriate behavioral and psychological signs and symptoms. Music therapy has been shown to decrease agitation and disruptive behaviors in patients with dementia, although improvement in overall cognitive function was minimal. However, there is evidence showing an increase in grey matter in those who actively participate in music activities. Our goal in this study is to focus on how participation in rhythm-based activities affects quality of life.https://scholarworks.uvm.edu/comphp_gallery/1276/thumbnail.jp

ScholarWorks @ UVM